Journal of Computational Biology — Latest Matching Preprints

1

Algebraic invariants for inferring 4-leaf semi-directed phylogenetic networks

Martin, S.; Moulton, V.; Leggett, R. M.

2023-09-14 evolutionary biology 10.1101/2023.09.11.557152 medRxiv

Top 0.1%

22.0%

Show abstract

A core goal of phylogenomics is to determine the evolutionary history of a set of species from biological sequence data. Phylogenetic networks are able to describe more complex evolutionary phenomena than phylogenetic trees but are more difficult to accurately reconstruct. Recently, there has been growing interest in developing methods to infer semi-directed phylogenetic networks. As computing such networks can be computationally intensive, one approach to building such networks is to puzzle together smaller networks. Thus, it is essential to have robust methods for inferring semi-directed phylogenetic networks on small numbers of taxa. In this paper, we investigate an algebraic method for performing phylogenetic network inference from nucleotide sequence data on 4-leaf semi-directed phylogenetic networks by analysing the distribution of leaf-pattern probabilities. On simulated data, we found that we can correctly identify with high accuracy the undirected phylogenetic network for sequences of length at least 10kbp. We found that identifying the semi-directed network is more challenging and requires sequences of length approaching 10Mbp. We are also able to use our approach to identify tree-like evolution and determine the underlying tree. Finally, we employ our method on a real dataset from Xiphophorus species and use the results to build a phylogenetic network.

2

Weighted Centroid Trees: A general approach for summarizing phylogenies in tumor mutation tree inference

Vasei, H.; Foroughmand-Araabi, M.-H.; Daneshgar, A.

2023-09-15 bioinformatics 10.1101/2023.09.11.557167 medRxiv

Top 0.1%

18.6%

Show abstract

Tumor mutation trees are the primary tools to model the evolution of cancer. Not only some tumor phylogeny inference methods may produce a set of trees having potential and parallel evolutionary histories, but also mutation trees from different patients may also exhibit similar evolutionary processes. When a set of correlated mutation trees is available, compressing the data into a single best-fit tree, exhibiting the shared evolutionary processes, is definitely of great importance and can be beneficial in many applications. In this study, we present a general setup to study and analyse the problem of finding a best-fit (centroid) tree to a given set of trees and we use our general setup to analyse mutation trees as our main motivation. For this let{varepsilon} :[T] n [->] [R]nxn be an embedding of labeled rooted trees into the space of real square matrices and also let L be a norm on this space. We introduce the nearest mapped tree problem as the problem of finding a closest tree to a given matrix A with respect to{varepsilon} and L, i.e., a tree T *(A) for which L({varepsilon}(T *(A)) - A) is minimized. Within this setup, our potential candidates for the embedding are adjacency, ancestry, and distance matrices of trees, where we consider the cases of L1 and L2 norms in our analysis. We show that the function d(T1, T2) = L({varepsilon}(T1) -{varepsilon} (T2)) defines a family of dissimilarity measures, covering previously studied parent-child and ancestor-descendent metrics. Also, we show that the nearest mapped tree problem is polynomial-time solvable for the adjacency matrix embedding and is[N][P] -hard for the ancestry and the distance embeddings. The weighted centroid tree problem for a given set of trees of size k is naturally defined as a nearest mapped tree solution to a weighted sum of the corresponding matrix set. In this article we consider uniform weighted-sums for which all weights are equal, where in particular, the (classical) centroid tree is defined to be a solution when all weights are chosen to be equal to 1/k (i.e., the mean case). Similarly, the{omega} -weighted centroid tree is a solution when all weights are equal to{omega} /k. To show the generality of our setup, we prove that the solution-set of the centroid tree problem for the adjacency and the ancestry matrices are identical to the solution-set of the consensus tree problem for parent-child and ancestor-descendent distances already handled by the algorithms GraPhyC(2018) and TuELiP(2023), respectively. Next, to tackle this problem for some new cases, we provide integer linear programs to handle the nearest mapped tree problem for the ancestry and the distance embeddings, giving rise to solutions of the weighted centroid tree problem in these cases. To show the effectiveness of this approach, we provide an algorithm, WAncILP2, to solvethe 2-weighted centroid tree problem for the case of the ancestry matrix and we justify the importance of the weighted setup by showing the pioneering performance of WAncILP2 both in a comprehensive simulation analysis as well as on a real breast cancer dataset, in which, by finding the centroids as representatives of data clusters, we provide supporting evidence for the fact that some common aspects of these centroids can be considered as suitable candidates for reliable evolutionary information in relation to the original data. metrics.

3

The Most Parsimonious Reconciliation Problem in the Presence of Incomplete Lineage Sorting and Hybridization is NP-Hard

LeMay, M.; Wu, Y.-C.; Libeskind-Hadas, R.

2021-03-16 evolutionary biology 10.1101/2021.03.14.435321 medRxiv

Top 0.1%

18.5%

Show abstract

The maximum parsimony phylogenetic reconciliation problem seeks to explain incongruity between a gene phylogeny and a species phylogeny with respect to a set of evolutionary events. While the reconciliation problem is well-studied for species and gene trees subject to events such as duplication, transfer, loss, and deep coalescence, recent work has examined species phylogenies that incorporate hybridization and are thus represented by networks rather than trees. In this paper, we show that the problem of computing a maximum parsimony reconciliation for a gene tree and species network is NP-hard even when only considering deep coalescence. This result suggests that future work on maximum parsimony reconciliation for species networks should explore approximation algorithms and heuristics.

4

EssentCell: Discovering Essential Evolutionary Relations in Noisy Single-Cell Data

Liyanage, A.; Burger, R.; Shi, A.; Sopp, B.; Zhu, B.; Mumey, B.

2025-04-18 cancer biology 10.1101/2025.04.12.648524 medRxiv

Top 0.1%

18.3%

Show abstract

Single-cell sequencing (SCS) enables investigating tumor evolution at a single cell resolution. A common type of analysis to investigate evolutionary structure from an SCS experiment is to determine a phylogenetic tree structure from the data. This problem has been well-studied under the assumption that mutations only accumulate in the evolution of cancer and there is a simple characterization of when the data is compatible with a perfect phylogeny based on the absence of a special "conflict" submatrix. SCS data can be represented as a binary matrix, where the ij-th entry indicates whether cell i has mutation j. In practice, SCS data is noisy, so a natural question is what is the minimum number of entries to flip in the data matrix, in order that the matrix becomes "conflict-free" and thus compatible with a perfect phylogeny. Furthermore, the false positive rate is orders of magnitude smaller than the false negative rate, so that at most a few false positives occur with high probability. We consider a variation of the minimum-flip problem in which the number of false positives in the solution is a parameter. Often, there can be multiple optimal solutions, so a natural question is what relations are true among all optimal solutions for a small range of possible false positives values; we call such relations essential. In this work, we propose an efficient algorithm based on integer linear programming to determine all essential relations in the data. We test our software tool, EssentCell, on several data sets and discuss the results found.

5

The De Bruijn Mapping Problem with Changes in the Graph

B. Rocha, L.; Adi, S. S.; Araujo, E.

2024-02-15 bioinformatics 10.1101/2024.02.15.580401 medRxiv

Top 0.1%

17.3%

Show abstract

In computational biology, mapping a sequence s onto a sequence graph G poses a significant challenge. One possible approach to tackling this problem is to find a walk p in G that spells a sequence most similar to s. This challenge is formally known as the Graph Sequence Mapping Problem (GSMP). In this paper, we delve into an alternative problem formulation known as the De Bruijn Graph Sequence Mapping Problem (BSMP). Both problems have three variants: changes only in the sequence, changes in the graph, and changes in both the sequence and the graph. We concentrate on addressing the variant involving changes in the graph. In the literature, when this problem does not allow the De Bruijn graph to induce new arcs after changes, it becomes NP-complete, as proven by Gibney et. al [4]. However, we reformulate the problem by considering the characteristics of the arcs induced in the De Bruijn graph. This reformulation alters the problem definition, thereby enabling the application of a polynomial-time algorithm for its resolution. Approaching the problem with this arc-inducing characteristic is new, and the algorithm proposed in this work is new in the literature.

6

A new clustering method for building multiple trees using deep learning.

Tahiri, N.

2019-10-04 evolutionary biology 10.1101/781252 medRxiv

Top 0.1%

14.5%

Show abstract

Each gene has its own evolutionary history which can substantially differ from the evolutionary histories of other genes. For example, some individual genes or operons can be affected by specific horizontal gene transfer or hybridization events. Thus, the evolutionary history of each gene should be represented by its own phylogenetic tree which may display different evolutionary patterns from the species tree, or Tree of Life, that represents the main patterns of vertical descent. Here, we present a new efficient method for inferring single or multiple consensus trees and supertrees for a given set of phylogenetic trees (i.e. additive trees or X-trees). The output of the traditional tree consensus methods is a unique consensus tree or supertree. Here, we show how Machine Learning (ML) models, based on some interesting properties of the Robinson and Foulds topological distance, can be used to partition a given set of trees into one (when the data are homogeneous) or multiple (when the data are heterogeneous) cluster(s) of trees. We adapt the popular Accuracy, Precision, Sensitivity, and F1 scores to the tree clustering. A special attention is paid to the relevant, but very challenging, problem of inferring alternative supertrees that are built from phylogenies defined on different, but mutually overlapping, sets of species. The use of an approximate objective function in clustering makes the new method faster than the existing tree clustering techniques and thus suitable for the analysis of large genomic datasets.

7

A Divide-and-Conquer Approach to Large-Scale Evolutionary Analysis of Single-Cell DNA Data

Liu, Y.; Nakhleh, L.

2025-03-17 cancer biology 10.1101/2024.04.28.591536 medRxiv

Top 0.1%

14.3%

Show abstract

Single-cell sequencing technology is producing large datasets, often containing thousands or even tens of thousands of single-cell genomic data points from an individual patient. Evolutionary analyses of these data sets help uncover and order genetic variants in the data as well as elucidate mutation trees and intra-tumor heterogeneity (ITH) in the case of cancer data sets. To enable such large-scale analyses computationally, we propose a divide-and-conquer approach that could be used to scale up computationally intensive inference methods. The approach consists of four steps: 1) partitioning the dataset into subsets, 2) constructing a rooted tree for each subset, 3) computing a representative genotype for each subset by utilizing its inferred tree, and 4) assembling the individual trees using a tree built on the representative genotypes. Besides its flexibility and enabling scalability, this approach also lends itself naturally to ITH analysis, as the clones would be the individual subsets, and the "assembly tree" could be the mutation tree that defines the clones. To demonstrate the effectiveness of our proposed approach, we conducted experiments employing a range of methods at each stage. In particular, as clustering and dimensionality reduction methods are commonly used to tame the complexity of large datasets in this area, we analyzed the performance of a variety of such methods within our approach.

8

Identifying Robust Subclonal Structures through Tumor Progression Tree Alignment

Gilbert, J.; Wu, C. H.; Knittel, H.; Schäffer, A. A.; Malikic, S.; Sahinalp, C.

2026-02-27 cancer biology 10.64898/2026.02.25.708046 medRxiv

Top 0.1%

12.4%

Show abstract

Understanding and comparing tumor evolutionary histories is fundamental to cancer genomics. Clonal trees, used to model tumor progression, are rooted, unordered trees in which each node represents a subclone labeled by a set of distinct mutations. To compare two clonal trees, we introduce omlta, the optimal multi-label tree alignment, which removes the minimum number of mutation labels from the trees, so that the remaining trees are isomorphic. Computing omlta is NP-hard. Here, we present an algorithm to compute the omlta, with a running time of [Formula] where L [≥] 1 is the total number of mutation labels occurring in the input trees and k is the minimum possible number of mutation labels that need to be removed for the alignment. Our implementation (https://github.com/algo-cancer/omlta) is the first computational tool for determining the optimal alignment between clonal trees. We applied omlta to 126 cases from the TRACERx study on non-small cell lung cancers and some melanoma single-cell data.

9

Fast and Accurate Species Trees from Weighted Internode Distances

Liu, B.; Warnow, T.

2022-05-26 evolutionary biology 10.1101/2022.05.24.493312 medRxiv

Top 0.1%

12.1%

Show abstract

Species tree estimation is a basic step in many biological research projects, but is complicated by the fact that gene trees can differ from the species tree due to processes such as incomplete lineage sorting (ILS), gene duplication and loss (GDL), and horizontal gene transfer (HGT), which can cause different regions within the genome to have different evolutionary histories (i.e., "gene tree heterogeneity"). One approach to estimating species trees in the presence of gene tree heterogeneity resulting from ILS operates by computing trees on each genomic region (i.e., computing "gene trees") and then using these gene trees to define a matrix of average internode distances, where the internode distance in a tree T between two species x and y is the number of nodes in T between the leaves corresponding to x and y. Given such a matrix, a tree can then be computed using methods such as neighbor joining. Methods such as ASTRID and NJst (which use this basic approach) are provably statistically consistent, very fast (low degree polynomial time) and have had high accuracy under many conditions that makes them competitive with other popular species tree estimation methods. In this study, inspired by the very recent work of weighted ASTRAL, we present weighted ASTRID, a variant of ASTRID that takes the branch uncertainty on the gene trees into account in the internode distance. Our experimental study evaluating weighted ASTRID shows improvements in accuracy compared to the original (unweighted) ASTRID while remaining fast. Moreover, weighted ASTRID shows competitive accuracy against weighted ASTRAL, the state of the art. Thus, this study provides a new and very fast method for species tree estimation that improves upon ASTRID, has comparable accuracy with the state of the art while remaining much faster. Weighted ASTRID is available at https://github.com/RuneBlaze/internode.

10

The Homo-Edit Distance Problem

Brand, M.; Tran, N. K.; Spohr, P.; Schrinner, S.; Klau, G. W.

2020-06-03 bioinformatics 10.1101/2020.05.27.118273 medRxiv

Top 0.1%

10.5%

Show abstract

We consider the homo-edit distance problem, which is the minimum number of homo-deletions or homo-insertions to convert one string into another. A homo-insertion is the insertion of a string of equal characters into another string, while a homo-deletion is the inverse operation. We show how to compute the homo-edit distance of two strings in polynomial time: We first demonstrate that the problem is equivalent to computing a common subsequence of the two input strings with a minimum number of homo-deletions and then present a dynamic programming solution for the reformulated problem. 2012 ACM Subject ClassificationApplied computing [->] Bioinformatics; Applied computing [->] Molecular sequence analysis; Theory of computation [->] Dynamic programming

11

Heuristics for the De Bruijn Graph Sequence Mapping Problem

Rocha, L. B.; Adi, S. S.; Araujo, E.

2023-02-07 bioinformatics 10.1101/2023.02.05.527069 medRxiv

Top 0.1%

10.4%

Show abstract

In computational biology, mapping a sequence s onto a sequence graph G is a significant challenge. One possible approach to addressing this problem is to identify a walk p in G that spells a sequence which is most similar to s. This problem is known as the Graph Sequence Mapping Problem (GSMP). In this paper, we study an alternative problem formulation, namely the De Bruijn Graph Sequence Mapping Problem (BSMP), which can be stated as follows: given a sequence s and a De Bruijn graph Gk (where k[≥] 2), find a walk p in Gk that spells a sequence which is most similar to s according to a distance metric. We present both exact algorithms and approximate distance heuristics for solving this problem, using edit distance as a criterion for measuring similarity.

12

Rapidly Computing the Phylogenetic Transfer Index

Truszkowski, J. M.; Gascuel, O.; Swenson, K.

2019-08-22 bioinformatics 10.1101/743948 medRxiv

Top 0.1%

10.2%

Show abstract

Given trees T and T* on the same taxon set, the transfer index {phi}(b, T*) is the number of taxa that need to be ignored so that the bipartition induced by branch b in T is equal to some bipartition in T*. Recently, Lemoine et al. [14] used the transfer index to design a novel bootstrap analysis technique that improves on Felsensteins bootstrap on large, noisy data sets. In this work, we propose an algorithm that computes the transfer index for all branches b [isin] T in O(n log3 n) time, which improves upon the current O(n2)-time algorithm by Lin, Rajan and Moret [15]. Our implementation is able to process pairs of trees with hundreds of thousands of taxa in minutes and considerably speeds up the method of Lemoine et al. on large data sets. We believe our algorithm can be useful for comparing large phylogenies, especially when some taxa are misplaced (e.g. due to horizontal gene transfer, recombination, or reconstruction errors).

13

Alignment Free Phylogeny Construction Using Maximum Likelihood Using k-mer Counts

Rahman, A. T. M. M.; Habib, S.; Islam, M. M.; Rahman, K. M.; Rahman, A.

2023-12-07 evolutionary biology 10.1101/2023.12.05.570306 medRxiv

Top 0.1%

10.2%

Show abstract

Estimating phylogenetic trees from molecular data often involves first performing a multiple sequence alignment of the sequences and then identifying the tree that maximizes likelihood computed under a model of nucleotide substitution. However, sequence alignment is computationally challenging for long sequences, especially in the presence of genomic rearrangements. To address this, methods for constructing phylogenetic trees without aligning the sequences i.e. alignment-free methods have been proposed. They are generally fast and can be used to construct phylogenetic trees of a large number of species but they primarily estimate phylogenies by computing pairwise distances and are not based on statistical models of molecular evolution. In this paper, we introduce a model for k-mer frequency change based on a birth-death-migration process which can be used to estimate maximum likelihood phylogenies from frequencies of k-mers in genomic sequences of species in an alignment-free approach. Experiments on real and simulated data demonstrate the efficacy of the model for likelihood based alignment-free phylogeny construction.

14

Prefix Block-Interchanges on Binary and Ternary Strings

Rahman, M. K.; Rahman, M. S.

2019-06-04 bioinformatics 10.1101/659664 medRxiv

Top 0.1%

10.2%

Show abstract

The genome rearrangement problem computes the minimum number of operations that are required to sort all elements of a permutation. A block-interchange operation exchanges two blocks of a permutation which are not necessarily adjacent and in a prefix block-interchange, one block is always the prefix of that permutation. In this paper, we focus on applying prefix block-interchanges on binary and ternary strings. We present upper bounds to group and sort a given binary/ternary string. We also provide upper bounds for a different version of the block-interchange operation which we refer to as the restricted prefix block-interchange. We observe that our obtained upper bound for restricted prefix block-interchange operations on binary strings is better than that of other genome rearrangement operations to group fully normalized binary strings. Consequently, we provide a linear-time algorithm to solve the problem of grouping binary normalized strings by restricted prefix block-interchanges. We also provide a polynomial time algorithm to group normalized ternary strings by prefix block-interchange operations. Finally, we provide a classification for ternary strings based on the required number of prefix block-interchange operations.

15

Comparing Methods for Species Tree Estimation With Gene Duplication and Loss

Willson, J.; Roddur, M.; Warnow, T.

2021-02-07 evolutionary biology 10.1101/2021.02.05.429947 medRxiv

Top 0.1%

9.8%

Show abstract

Species tree inference from gene trees is an important part of biological research. One confounding factor in estimating species trees is gene duplication and loss which can lead to gene trees with multiple copies of the same gene. In recent years there have been several new methods developed to address this problem that have substantially improved on earlier methods; however, the best performing methods (ASTRAL-Pro, ASTRID-multi, and FastMulRFS) have not yet been directly compared. In this study, we compare ASTRAL-Pro, ASTRID-multi, and FastMulRFS under a wide variety of conditions. Our study shows that while all three have very good accuracy, nearly the same under many conditions, ASTRAL-Pro and ASTRID-multi are more reliably accurate than FastMuLRFS, and that ASTRID-multi is often faster than ASTRAL-Pro. The datasets generated for this study are freely available in the Illinois Data Bank at https://databank.illinois.edu/datasets/IDB-2418574

16

Efficient Agony Based Transfer Learning Algorithms for Survival Forecasting

Tamaskar, A.; Bannon, J.; Mishra, B.

2021-02-25 cancer biology 10.1101/2021.02.24.432695 medRxiv

Top 0.1%

9.0%

Show abstract

Progression modeling is a mature subfield of cancer bioinformatics, but it has yet to make a proportional clinical impact. The majority of the research in this area has focused on the development of efficient algorithms for accurately reconstructing sequences of (epi)genomic events from noisy data. We see this as the first step in a broad pipeline that will translate progression modeling to clinical utility, with the subsequent steps involving inferring prognoses and optimal therapy programs for different cancers and using similarity in progression to enhance decision making. In this paper we take some initial steps in completing this pipeline. As a theoretical contribution, we introduce a polytime-computable pairwise distance between progression models based on the graph-theoretic notion of "agony". Focusing on a particular progression model we can then use this agony distance to cluster (dis)similarities via multi-dimensional scaling. We recover known biological similarities and dissimilarities. Finally, we use the agony distance to automate transfer learning experiments and show a large improvement in the ability to forecast time to death.

17

Evolutionary Distance of Gene-Gene Interactions: Estimation under Statistical Uncertainty

Gu, X.

2020-03-09 evolutionary biology 10.1101/2020.03.08.982710 medRxiv

Top 0.1%

8.6%

Show abstract

Consider the functional interaction of gene A to an interaction subject X; for instance, it is the gene-gene interaction if X represents for a gene, or gene-tissue interaction (expression status) if X for a tissue. In the simplest case, the status of this A-X interaction is r=1 if they are interacted, or r=0 otherwise. A fundamental problem in molecular evolution is, given two homologous (orthologous or paralogous) genes A and B, to what extent their functional overlapping could be by the means of interaction networks. Given a set of interaction subjects (X1, ... XN), it is straightforward to calculate the interaction distance (IAB) between genes A and B, by a Markov-chain model. However, since the high throughput interaction data always involve a high level of noises, reliable inference of r=1 or r=0 for each gene remains a big challenge. Consequently, the estimated interaction distance (IAB) is highly sensitive to the cutoff of interaction inference which is subject to some arbitrary. In this paper we will address this issue by developing a statistical method for estimating IAB based on the p-values (significant levels). Computer simulations are carried out to evaluate the performance of different p-value transformations against the uncertainty of interaction networks.

18

Fast computation of principal components of genomic similarity matrices

Hahn, G.; Lutz, S.; Hecker, J.; Prokopenko, D.; Cho, M.; Silverman, E. K.; Weiss, S. T.; Lange, C.

2022-10-08 bioinformatics 10.1101/2022.10.06.511168 medRxiv

Top 0.1%

8.6%

Show abstract

The computation of a similarity measure for genomic data, for instance using the (genomic) covariance matrix, the Jaccard matrix, or the genomic relationship matrix (GRM), is a standard tool in computational genetics. The principal components of such matrices are routinely used to correct for biases in, for instance, linear regressions. However, the calculation of both a similarity matrix and its singular value decomposition (SVD) are computationally intensive. The contribution of this article is threefold. First, we demonstrate that the calculation of three matrices (the genomic covariance matrix, the weighted Jaccard matrix, and the genomic relationship matrix) can be reformulated in a unified way which allows for an exact, faster SVD computation. An exception is the Jaccard matrix, which does not have a structure applicable for the fast SVD computation. An exact algorithm is proposed to compute the principal components of the genomic covariance, weighted Jaccard, and genomic relationship matrices. The algorithm is adapted from an existing randomized SVD algorithm and ensures that all computations are carried out in sparse matrix algebra. Second, an approximate Jaccard matrix is introduced to which the fast SVD computation is applicable. Third, we establish guaranteed theoretical bounds on the distance (in L2 norm and angle) between the principal components of the Jaccard matrix and the ones of our proposed approximation, thus putting the proposed Jaccard approximation on a solid mathematical foundation. We illustrate all computations on both simulated data and data of the 1000 Genome Project, showing that the approximation error is very low in practice.

19

Fast Approximate IsoRank for Scalable Global Alignment of Biological Networks

Devkota, K.; Cowen, L. J.; Blumer, A.; Hu, X.

2023-03-15 bioinformatics 10.1101/2023.03.13.532445 medRxiv

Top 0.1%

8.5%

Show abstract

A well-studied approximate version of the graph matching problem is directly relevant for the study of protein-protein interaction networks. Called by the computational biology community Global Network Alignment, the two networks to be matched are derived from the protein-protein interaction (PPI) networks from organisms of two different species. If these two species evolved recently from a common ancestor, we can view the two PPI networks as a single network that evolved over time. It is the two versions of this network that we want to align using approximate graph matching. The first spectral method for the PPI global alignment problem proposed by the biological community was the IsoRank method of Singh et al. This method for global biological network alignment is still used today. However, with the advent of many more experiments, the size of the networks available to match has grown considerably, making running IsoRank unfeasible on these networks without access to state of the art computational resources. In this paper, we develop a new IsoRank approximation, which exploits the mathematical properties of IsoRanks linear system to solve the problem in quadratic time with respect to the maximum size of the two PPI networks. We further propose a computationally cheaper refinement to this initial approximation so that the updated result is even closer to the original IsoRank formulation. In experiments on synthetic and real PPI networks, we find that the results of our approximate IsoRank are not only nearly as accurate as the original IsoRank results but are also much faster, which makes the global alignment of large-scale biological networks feasible and scalable.

20

Geometry of Ranked Nearest Neighbour Interchange Space of Phylogenetic Trees

Collienne, L.; Elmes, K.; Fischer, M.; Bryant, D. J.; Gavryushkin, A.

2020-02-11 bioinformatics 10.1101/2019.12.19.883603 medRxiv

Top 0.1%

8.5%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWIn this paper we study the graph of ranked phylogenetic trees where the adjacency relation is given by a local rearrangement of the tree structure. Our work is motivated by tree inference algorithms, such as maximum likelihood and Markov Chain Monte Carlo methods, where the geometry of the search space plays a central role for efficiency and practicality of optimisation and sampling. We hence focus on understanding the geometry of the space (graph) of ranked trees, the so-called ranked nearest neighbour interchange (RNNI) graph. We find the radius and diameter of the space exactly, improving the best previously known estimates. Since the RNNI graph is a generalisation of the classical nearest neighbour interchange (NNI) graph to ranked phylogenetic trees, we compare geometric and algorithmic properties of the two graphs. Surprisingly, we discover that both geometric and algorithmic properties of RNNI and NNI are quite different. For example, we establish convexity of certain natural subspaces in RNNI which are not convex is NNI. Our results suggest that the complexity of computing distances in the two graphs is different.